Toward a Large Spontaneous Mandarin Dialogue Corpus
نویسنده
چکیده
This paper addresses recent results on Mandarin spoken dialogues and introduces the collection of a large Mandarin conversational dialogue corpus. In the context of data processing, principles of transcription are proposed and accordingly a transcription tool is specifically developed for Mandarin spoken conversations.
منابع مشابه
Important and new features with analysis for disfluency interruption point (IP) detection in spontaneous Mandarin speech
This paper presents a whole set of new features, some duration-related and some pitch-related, to be used in disfluency interruption point (IP) detection for spontaneous Mandarin speech, considering the special linguistic characteristics of Mandarin Chinese. Decision tree is incorporated into the maximum entropy model to perform the IP detection. By examining performance degradation when each s...
متن کاملMandarin Topic-oriented Conversations
This paper describes the collection and processing of a pilot speech corpus annotated in dialogue acts. The Mandarin Topic-oriented Conversational Corpus (MTCC) consists of annotated transcripts and sound files of conversations between two familiar persons. Particular features of spoken Mandarin, such as discourse particles and paralinguistic sounds, are taken into account in the orthographical...
متن کاملAutomatic generation of pronunciation lexicons for Mandarin spontaneous speech
Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed for Engl...
متن کاملHKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus
The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either stran...
متن کاملDUEL: A Multi-lingual Multimodal Dialogue Corpus for Disfluency, Exclamations and Laughter
We present the DUEL corpus, consisting of 24 hours of natural, face-to-face, loosely task-directed dialogue in German, French and Mandarin Chinese. The corpus is uniquely positioned as a cross-linguistic, multimodal dialogue resource controlled for domain. DUEL includes audio, video and body tracking data and is transcribed and annotated for disfluency, laughter and exclamations.
متن کامل